Search CORE

125 research outputs found

Spoken content retrieval: A survey of techniques and technologies

Author: Ani Nenkova
C A. Nenkova
K. Mckeown
Kathleen Mckeown
Publication venue: 'Now Publishers'
Publication date: 01/01/2012
Field of study

Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference

Author: Nenkova Ani
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2005
Field of study

Since 2001, the Document Understanding Conferences have been the forum for researchers in automatic text summarization to compare methods and results on common test sets. Over the years, several types of summarization tasks have been addressed--single document summarization, multi-document summarization, summarization focused by question, and headline generation. This paper is an overview of the achieved results in the different types of summarization tasks. We compare both the broader classes of baselines, systems and humans, as well as individual pairs of summarizers (both human and automatic). An analysis of variance model is fitted, with summarizer and input set as independent variables, and the coverage score as the dependent variable, and simulation-based multiple comparisons were performed. The results document the progress in the field as a whole, rather then focusing on a single system, and thus can serve as a future reference on the work done up to date, as well as a starting point in the formulation of future tasks. Results also indicate that most progress in the field has been achieved in generic multi-document summarization and that the most challenging task is that of producing a focused summary in answer to a question/topic

CiteSeerX

Columbia University Academic Commons

Detecting (Un)Important Content for Single-Document News Summarization

Author: Bao Forrest Sheng
Nenkova Ani
Yang Yinfei
Publication venue
Publication date: 01/01/2017
Field of study

We present a robust approach for detecting intrinsic sentence importance in news, by training on two corpora of document-summary pairs. When used for single-document summarization, our approach, combined with the "beginning of document" heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an important advance because in the absence of cross-document repetition, single document summarizers for news have not been able to consistently outperform the strong beginning-of-article baseline.Comment: Accepted By EACL 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Improving the Coherence of Multi-document Summaries: A Corpus Study for Modeling the Syntactic Realization of Entities

Author: McKeown Kathleen
Nenkova Ani
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2003
Field of study

References included in multi-document summaries are often problematic. In this paper, we present a corpus study performed to derive statistical models for the syntactic realization of referential expressions. Our work shows how the syntactic realization of entities can influence the coherence of the text and provides a model for rewriting references in multi-document summaries to smooth disfluencies. It shows how the syntactic realization of entities can influence the coherence of the text and how rewrite change s can smooth the disfluencies. A large corpus study is conducted in order to derive initial models for syntactic realization

Columbia University Academic Commons

Automatic Detection of Contrastive Elements in Spontaneous Speech

Author: Jurafsky Dan
Nenkova Ani
Publication venue: ScholarlyCommons
Publication date: 01/01/2007
Field of study

In natural speech people use different levels of prominence to signal which parts of an utterance are especially important. Contrastive elements are often produced with stronger than usual prominence and their presence modifies the meaning of the utterance in subtle but important ways. We use a richly annotated corpus of conversational speech to study the acoustic characteristics of contrastive elements and the differences between them and words at other levels of prominence. We report our results for automatic detection of contrastive elements based on acoustic and textual features, finding that a baseline predicting nouns and adjectives as contrastive performs on par with the best combination of features. We achieve a much better performance in a modified task of detecting contrastive elements among words that are predicted to bear pitch accent

CiteSeerX

Crossref

ScholarlyCommons@Penn

Creating Local Coherence: An Empirical Assessment

Author: Ani Nenkova
Annie Louis
Publication venue
Publication date: 01/06/2010
Field of study

Two of the mechanisms for creating natural transitions between adjacent sentences in a text, resulting in local coherence, involve discourse relations and switches of focus of attention between discourse entities. These two aspects of local coherence have been traditionally discussed and studied separately. But some empirical studies have given strong evidence for the necessity of understanding how the two types of coherence-creating devices interact. Here we present a joint corpus study of discourse relations and entity coherence exhibited in news texts from the Wall Street Journal and test several hypotheses expressed in earlier work about their interaction.

CiteSeerX

ScholarlyCommons@Penn

Predicting the Fluency of Text with Shallow Structural Features: Case Studies of Machine Tanslation and Human-Written Text

Author: Chae Jieun
Nenkova Ani
Publication venue: ScholarlyCommons
Publication date: 01/03/2009
Field of study

Sentence fluency is an important component of overall text readability but few studies in natural language processing have sought to understand the factors that define it. We report the results of an initial study into the predictive power of surface syntactic statistics for the task; we use fluency assessments done for the purpose of evaluating machine translation. We find that these features are weakly but significantly correlated with fluency. Machine and human translations can be distinguished with accuracy over 80%. The performance of pairwise comparison of fluency is also very high—over 90% for a multi-layer perceptron classifier. We also test the hypothesis that the learned models capture general fluency properties applicable to human-written text. The results do not support this hypothesis: prediction accuracy on the new data is only 57%. This finding suggests that developing a dedicated, task-independent corpus of fluency judgments will be beneficial for further investigations of the problem

CiteSeerX

ScholarlyCommons@Penn

Improving the Estimation of Word Importance for News Multi-Document Summarization - Extended Technical Report

Author: Hong Kai
Nenkova Ani
Publication venue: ScholarlyCommons
Publication date: 03/02/2014
Field of study

In this paper, we propose a supervised model for ranking word importance that incorporates a rich set of features. Our model is superior to prior approaches for identifying words used in human summaries. Moreover we show that an extractive summarizer which includes our estimation of word importance results in summaries comparable with the state-of-the-art by automatic evaluation

ScholarlyCommons@Penn